fix(core,cli): close tool_use↔tool_result invariant across all failure paths#4176
Conversation
…_use
Weak-network failures during an Anthropic-compatible stream (DeepSeek,
api.anthropic.com, etc.) can drop the SSE between a tool_use
`content_block_stop` and the terminal `message_stop`. The functionCall
chunk is already yielded at content_block_stop, so:
- Turn.run records a ToolCallRequest event.
- useGeminiStream's for-await exits and schedules the tool.
- handleCompletedTools eventually fires submitQuery(..., ToolResult)
and pushes a user[functionResponse] into history.
- But processStreamResponse's history.push for the model turn never
ran (the for-await threw first), so the matching tool_use is gone.
The next request body has `user → user[tool_result]` with no tool_use
in between, and the server rejects with HTTP 400:
"tool_use_id ... must have a corresponding tool_use block in the
previous message". Ctrl+Y can't recover because
stripOrphanedUserEntriesFromHistory only strips trailing user
entries — the lost tool_use is unrecoverable, and the session is
wedged.
Wrap the for-await loop in processStreamResponse with try/catch.
When the stream throws AND any functionCall chunk was already
yielded (hasToolCall=true), persist the partial assistant turn to
history before re-throwing. The eventual tool_result submission
then has a matching tool_use and the session can continue.
Plain-text partial turns (no functionCall yielded) are intentionally
NOT persisted: the Retry path pops the trailing user prompt and
re-issues it, so a stale partial-text model turn between them would
either bias the retry or surface as duplicate output.
📋 Review SummaryThis PR fixes a critical unrecoverable wedge that occurs when streaming connections drop mid-tool_use on weak networks. The fix ensures partial assistant turns containing tool calls are persisted to history before re-throwing stream errors, maintaining the tool_use/tool_result pairing invariant required by Anthropic-compatible APIs. The implementation is well-reasoned with comprehensive tests and excellent documentation. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
…ry fix Adds a third case to the partial-history persistence test suite for reasoning-mode providers (DeepSeek thinking, Claude 4.6+ adaptive): when the assistant turn streams a thinking block AND a tool_use before the SSE drops, the partial push must keep the thinking part before the functionCall so DeepSeek's `injectThinkingOnToolUseTurns` converter pass sees an existing block on the replayed turn and does not pre-pend a synthetic empty one (which would discard the model's original reasoning text).
There was a problem hiding this comment.
Pull request overview
Fixes a wedge state on weak networks where Anthropic-compatible streams (DeepSeek, Anthropic, etc.) drop after a tool_use content_block_stop but before message_stop. The yielded functionCall chunk causes downstream tool execution and a follow-up tool_result user turn, but no matching tool_use was ever pushed into history, leading to permanent 400s. The fix wraps the stream loop in processStreamResponse with try/catch and persists the partial assistant turn into history when a tool call has already been yielded before re-throwing.
Changes:
- Capture mid-stream errors in
processStreamResponseinstead of letting them abort post-loop processing. - When
hasToolCallis true on error, push the partial assistant turn (thinking + consolidated parts) into history before re-throwing, preserving tool_use/tool_result pairing. - Add three vitest cases covering tool-only, thinking+tool, and text-only mid-stream failures.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/core/src/core/geminiChat.ts | Adds try/catch around stream iteration and a recovery branch that conditionally persists the partial model turn before re-throwing. |
| packages/core/src/core/geminiChat.test.ts | Adds three new tests verifying partial-turn persistence with tool_use, with thinking+tool, and non-persistence for text-only failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yiliang114
left a comment
There was a problem hiding this comment.
The fix recovers the common SSE-drop case correctly. Left one non-blocking follow-up inline about closing the tool_use ↔ tool_result invariant at the failure point rather than handing off to the scheduler.
|
For reviewers — this PR is the minimal, surgical fix. The broader weak-network resilience story has two follow-ups filed so they can be reviewed independently:
Together (#4176 + #4177 + #4178) qwen-code's behavior on weak networks would be comparable to upstream Claude Code, addressing the same class of bug as |
Extends the partial-history fix in fe35e37 to cover the residual race paths surfaced in PR QwenLM#4176 review: - Race A: Ctrl+Y while in-flight tool hasn't finished. History is [user, model(tool_use)] — `stripOrphanedUserEntriesFromHistory` only pops trailing user entries, so the retry payload lands as a fresh user turn after the orphan tool_use and API rejects. Meanwhile the scheduler's `onAllToolCallsComplete` is single-shot and gated on `isResponding`, so the eventual tool_result is silently swallowed. - Race B: process crash / OOM / SIGKILL between the partial-tool_use push and the React scheduler's tool_result submission. On `--resume` the dangling model(tool_use) wedges the first API call. - Race C: external tooling / manual JSONL edits leaving the same dangling shape. The fix has three pieces working together: 1. `repairOrphanedToolUseTurns(history)` in geminiChat.ts walks history left-to-right and synthesizes an `error`-typed functionResponse for every functionCall whose id is not echoed back in the next user turn. Appends to an existing user turn when present, otherwise inserts a new one. Returns the injected (callId, name) list. 2. `GeminiClient.repairOrphanedToolUseTurnsInHistory()` wraps the helper and is called from three points: - `startChat()` after loading the transcript (Race B/C, --resume). - `sendMessageStream` Retry branch after stripOrphans (Race A). - `sendMessageStream` UserQuery/Cron branch (defensive belt-and- suspenders for anything that slipped past 1 and 2). 3. `handleCompletedTools` in useGeminiStream.ts dedupes against chat.history before submitting tool_results — if a synthetic functionResponse for the same callId is already present (planted by the repair pass), the in-flight scheduler's late result is dropped and the call is `markToolsAsSubmitted` so the UI advances. Same trade-off upstream Claude Code's `StreamingToolExecutor.discard()` makes — late real results are dropped on the wire after synthesis, the model sees the synthetic error and can retry the tool if it still wants the result. Together with the partial-history push from fe35e37, every tool_use that ever streamed to the consumer is guaranteed to have a matching tool_result on the wire — regardless of whether the stream errored, the user retried mid-flight, the process crashed, or the session is later resumed. This is the qwen-code analogue of upstream Claude Code's `yieldMissingToolResultBlocks` (query.ts:123-149), but split across the core/cli boundary because the React tool scheduler runs out-of-band from the stream loop (so the synthesis path can't atomically discard in-flight tools the way upstream's StreamingToolExecutor can; the history-dedup at handleCompletedTools fills that gap instead). Tests: - 8 new repair-helper tests in geminiChat.test.ts cover Race A, Race B, partial coverage of parallel tool_use, idempotence on already-paired history, no-op on tool-free history, caller- supplied reason text, multiple non-adjacent dangling rounds, and routes through the GeminiChat instance-method wrapper. - client.test.ts mocks updated for the new GeminiChat method. - All 88 geminiChat tests pass; 131 client tests pass; 91 useGeminiStream tests pass.
Post-review audit of 3de3241 surfaced two real races the earlier fix did not actually close: - Race A (the one yiliang114 originally flagged): handleCompletedTools is gated on `isResponding` (useGeminiStream:1971) BEFORE the dedup branch ran, so when the user Ctrl+Y'd mid-tool, the dedup never fired and the in-flight tool stayed permanently stuck in `completed-but-not-submitted` (the scheduler's `allToolCallsCompleteHandler` is single-shot). The dedup branch was structurally unreachable on the very race it was meant to defend against. - Race in the Retry path: the repair pass in `client.sendMessageStream` Retry branch ran BEFORE the chat-internal `push(userContent)`, so a Retry of a previous ToolResult submission (lastPrompt is a functionResponse part array) raced its own real `functionResponse` against the synthesized error one. The synthesis was winning and pre-empting the real result, producing two `functionResponse` entries with the same callId on the wire. This commit relocates both pieces to where the contract actually holds: 1. Move the dedup branch in `handleCompletedTools` to BEFORE the `isResponding` early-return so `markToolsAsSubmitted` always runs for callIds already paired in history, unblocking the UI/scheduler even when a new stream is in flight. The submission-side `geminiTools = completedAndReadyToSubmitTools.filter(... && !inHistory ...)` then covers the no-double-submit case in one expression. 2. Move the repair call from `client.sendMessageStream` Retry branch into `chat.sendMessageStream` immediately AFTER the user-supplied turn is pushed. The user's own tool_result (when the Retry payload carries one) gets the first chance to close the pair before the synthesizer sees it as dangling. `client.startChat()` keeps a belt-and-suspenders pass at session-load time so any pre-send code reading `chat.history` sees a well-formed shape. Adds three integration tests covering the new wiring: - geminiChat.test.ts: chat.sendMessageStream synthesizes a functionResponse when history carries a dangling tool_use AND the user-supplied content doesn't already close it (Race B/C plus Race A from the chat side). - geminiChat.test.ts: chat.sendMessageStream does NOT synthesize when the user-supplied content IS a matching functionResponse (the Retry-of-ToolResult race the previous wiring tripped on). - useGeminiStream.test.tsx: handleCompletedTools dedups a late real result whose callId already has a functionResponse in chat.history (Race A end-to-end: markToolsAsSubmitted fires but no submitQuery is dispatched). Test summary: 90/90 geminiChat, 131/131 client, 92/92 useGeminiStream; all 8186 core tests pass; tsc clean. Known limitation logged for follow-up: a Retry of ToolResult whose intervening stream itself partial-pushed a SECOND tool_use produces a trailing user turn with both the stale re-pushed `fr_A` and the synthesized `fr_B` for the second call — `fr_A` retry lands as an orphan because `fc_A` was already paired earlier in history. Triggered only when partial-push fires on the second stream AND user retries the tool_result rather than the user prompt; extremely low frequency in practice. Cleanest fix is in the retryLastPrompt layer (don't re-push an already-paired tool_result), which is out of scope for this PR.
wenshao
left a comment
There was a problem hiding this comment.
Review Summary: 2 files, +270/-36. Fixes unrecoverable wedge when streaming drops between tool_use content_block_stop and message_stop. Core logic is correct; 606 tests pass; tsc + eslint clean.
No high-confidence Critical issues found. 5 low-confidence suggestions identified (not posted as inline comments — needs human review):
- AbortSignal mid-tool_use persists partial turn — catch 块未区分 abort 与网络错误,取消后 history 含孤立 tool_use。建议加 守卫。
- Zero logging on partial turn persistence — 静默修改 history,调试困难。建议加 。
- recordAssistantTurn 在错误路径无条件调用 — 失败流的部分数据被记录为正常 turn。建议移入 块。
- Missing thinking-only test — 推理模型场景(thought:true 无 functionCall)无回归覆盖。
- Duplicate history.push 代码 — 错误/成功路径各一份相同实现,建议提取方法。
— DeepSeek/deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Review surfaced four issues on top of b2fed61. All addressed here: - [P0] Synthetic functionResponse was appended to the trailing user turn's parts; on a Ctrl+Y race the shape became `user[text, fr]` and Anthropic-compatible backends still 400 because tool_result blocks must come FIRST in a user message (Claude Code does the same hoist in utils/messages.ts:hoistToolResults). Repair now slots synthetic fr in before the first non-functionResponse part, after any pre-existing real fr parts (which preserves real-tool-result order on a parallel-partial-submit). Tests asserting the part order added. - [P0] The dedup test fixture in useGeminiStream.test.tsx forced an incompatible object through a direct cast to TrackedCompletedToolCall; `tsc` rejected it with TS2352. Fixture restructured to match the type's actual shape (matches the existing `Mocked` patterns at L597-606 / L2396-2401) so the changed-file typecheck stays green. - [P1] On stream error, `recordAssistantTurn` ran unconditionally (gated only on parts being non-empty), so partial text-only turns we intentionally do NOT push into in-memory history were still written to the chat-recording JSONL — leaving --resume's transcript-load path to re-inject a model turn the in-session run discarded. Gate now matches the `history.push` decision: record iff we will persist (success path OR stream-error-with-tool_use). - [P2] Added client-level tests for the startChat repair wiring: extraHistory ending in `model[functionCall]` triggers synthetic fr injection on init; happy-path resume is a no-op. Defends against a future reorder/removal of the call in startChat() regressing the --resume recovery path. All 91 geminiChat, 133 client, 92 useGeminiStream tests pass. Full core suite (8186 tests) green. Changed-file tsc clean except for the pre-existing `recordCompletedToolCall` typing gap that exists on main.
|
Addressed all review feedback in [P0] [P0] TS2352 in the dedup test fixture — My [P1] [P2] Test summary: 91/91 geminiChat, 133/133 client, 92/92 useGeminiStream; full core suite (8186 tests) green; changed-file Untouched (acknowledged but out-of-scope):
|
…wenLM#4176) One Suggestion thread from the deepseek-v4-pro pass on commit c30bba6. `MockedGeminiClientClass` did not expose `getHistoryFunctionResponseIds`, so 3 of the 4 dedup tests in `useGeminiStream.test.tsx` fell through to the `else if (typeof geminiClient.getHistory === 'function')` branch — the `structuredClone(this.history)` slow path. Only the mixed-batch test added in c30bba6 wired the fast path explicitly. Net effect: a regression in `getHistoryFunctionResponseIds` (wrong ids, missing ids, exception) would silently re-route every dedup batch onto the slow clone path with all 4 tests still green, while production paid the multi-millisecond UI-thread stall this PR specifically added the accessor to avoid. Fix in two parts: 1. Add `getHistoryFunctionResponseIds = vi.fn().mockReturnValue(new Set<string>())` to the `MockedGeminiClientClass` default. Every test now exposes the fast-path accessor, so the dispatcher takes the `getHistoryFunctionResponseIds` branch by default — matching production. Tests that need a non-empty dedup set override the mock explicitly. 2. Override the mock in each of the three previously-affected dedup tests with a `Set` containing the callId(s) their `getHistory()` fixture had paired. The dedup assertions stay identical — but the path the production code takes to reach them now matches the path under test. Tests: 96/96 useGeminiStream (no new tests; existing dedup tests now exercise the fast path the previous commit added). 353/353 across core + cli. tsc + eslint + prettier clean.
yiliang114
left a comment
There was a problem hiding this comment.
Reviewed the 8 commits since my last pass (2dbfc4e → 06a6951). The deferred-flush mechanism directly addresses the chat-recording inconsistency I flagged — design looks correct and the lifecycle pairing (set → pop on retry-success / flush on unretryable break) is clean. Hoist + duplicate-drop in the repair pass close real wedge paths I hadn't considered. A few observations:
P1 — repairOrphanedToolUseTurns complexity
The function grew from ~50 to ~200 lines with scan + hoist + dedup + splice + empty-turn cleanup all interleaved in one forward-walk loop. The allRemovalTargets sort-by-desc-then-splice approach is correct today, but the interleaving of scan and mutation phases makes it brittle for future additions. If another fix-up category lands here (e.g. multi-model parallel tool dispatch), I'd expect index-drift bugs.
Suggestion for a follow-up: refactor into scan phase (collect locations) → decision phase (classify synthesize/hoist/drop) → mutation phase (apply all changes). Not blocking, but flagging because this is the most complex single function in the repair subsystem now.
P1 — finally block swallows recording errors silently
try {
self.chatRecordingService?.recordAssistantTurn(...);
} catch (recordErr) {
debugLogger.warn('[PARTIAL_FLUSH] Failed to persist ...');
}If JSONL writes fail persistently (disk full, permission), every deferred partial is silently lost. The "eventual consistency" trade-off is fine for transient failures, but a sustained failure deserves an error-level log or an increment of a failure counter so monitoring can catch it.
P2 — Minor suggestions (non-blocking)
-
addHistorywarn log: the comment says "today this can't happen" — if it's a true invariant violation,debugLogger.errorwith "invariant violation" framing would be more appropriate thanwarn(which is easy to ignore in noisy logs). -
Three-branch duck-type fallback in
useGeminiStream: thegetHistory()fallback exists for test mocks, but all mocks now implementgetHistoryFunctionResponseIds. The middle branch is dead code in production. Consider removing it and enforcing the interface contract in tests. -
PR-thread references in code comments: phrases like "gpt-5.5 review thread on PR #4176" and "qwen-latest-series-invite-beta-v34 thread on PR #4176" are useful during review but become noise after merge. Worth a cleanup pass before landing — keep the "why" explanation, drop the reviewer/thread citations.
-
Test coverage gap: the original
should retry on TPM throttling StreamContentError with initial delayandshould use Retry-After delay for streamed rate-limit errorstests were fully replaced. The new tests cover the partial-rollback angle well, but the original assertions on delay timing and Retry-After header parsing don't have direct equivalents. If those behaviors are covered elsewhere, fine — otherwise it's a small regression in coverage.
Overall: the core fix is solid and the test matrix is thorough. The main concern is long-term maintainability of the repair function — everything else is polish.
…#4176) Five items from the yiliang114 review on commit 06a6951. Two P1 (observability + structural maintainability) and three editorial. [A] P1 — repairOrphanedToolUseTurns refactored into three pure phases: scanModelTurn / planRepair / applyRepair. Each runs in isolation against narrow `ScanResult` / `RepairPlan` types, and only the last mutates `history`. The outer forward-walk loop is now a ~25-line orchestrator: scan → bail on empty expected → plan → bail on empty plan → apply → record `injected` + `droppedDuplicates` → advance cursor past any freshly-inserted user turn. Index drift can only happen in applyRepair, so an audit narrows to one function. Behavior identical to the pre-refactor monolith — all 13 existing repair tests pass unchanged. [B] P1 — finally-block JSONL flush error log promoted from `debugLogger.warn` to `debugLogger.error`. A persistent write failure (disk full, permission denied, serialization broken) silently loses every deferred partial after the first occurrence; that's the class of failure that warrants monitoring attention, not the warn category that fades into log noise. Single-occurrence transient failures still surface as one error per occurrence. [C] P2 — addHistory partial-marker log promoted to `debugLogger.error` with `[INVARIANT_VIOLATION]` tag. The existing call graph cannot legitimately hit this branch (it only fires when addHistory runs mid-sendMessageStream, which no current caller does); any future hit is a true bug in a new caller, not noise. [D] P2 — removed the `getHistory()` fallback branch in `useGeminiStream.handleCompletedTools`. Verified via grep: every GeminiClient consumer (real production code + MockedGeminiClientClass + every individual dedup test) now exposes `getHistoryFunctionResponseIds`, so the three-branch duck-type dispatch was dead-code at production level and only existed to support an interim cycle when the fast-path accessor was being introduced. The dispatcher is now a one-liner; production path is strictly the fast accessor, with an empty Set returned if `geminiClient` itself is missing (only happens in hook-level unit tests with no client at all). [E] P2 — cleanup pass dropping reviewer/thread citations from code comments (e.g. "qwen-latest-series-invite-beta-v34 thread on PR QwenLM#4176", "gpt-5.5 review thread", "deepseek-v4-pro thread", "yiliang114 repro"). The "why" explanations stay — those are the load-bearing context. Reviewer/thread attribution lives in the PR history; embedding it in long-term comments was noise that ages poorly. Touched 9 sites in geminiChat.ts, 2 in client.ts, 1 in useGeminiStream.ts, 4 in useGeminiStream.test.tsx, and 8 in geminiChat.test.ts (test names + comment blocks). Also: P2.4 (yiliang114 flagged a potential test-coverage regression on TPM throttling + Retry-After delay) verified as a false alarm — `should retry on TPM throttling StreamContentError with initial delay` (geminiChat.test.ts:2959) and `should use Retry-After delay for streamed rate-limit errors` (line 3035) both still exist. Tests: 353/353 (no new tests; all existing pass through the refactored repair function and the simplified dispatcher). tsc + eslint + prettier clean.
yiliang114
left a comment
There was a problem hiding this comment.
LGTM. The three-phase refactor is clean, error-level promotions look right, and the dead fallback branch is gone. Ship it.
|
Comment density in The fix itself looks sound; this note is only about comment volume. Roughly 60% of the +918 lines in
Why I'd trim it:
To be clear, I'm not arguing against documenting the invariants: the bug is subtle and already caused a production wedge, so capturing the tool_use↔tool_result contract is well justified. It's the volume and the speculative future-proofing prose I'd push back on. Suggestion (fine as a follow-up, shouldn't gate the merge):
中文
修复本身没问题,这条只针对注释体量。
为什么建议精简:
需要说明:我并不反对记录这些不变量 —— 这个 bug 很微妙、且已造成过线上死锁,把 tool_use↔tool_result 的契约写下来完全合理。我想商榷的是注释的体量,以及那些防御性的"未来可能"长篇推演。 建议(作为后续提交即可,不应阻塞合并):
|
…ool-use-on-stream-error # Conflicts: # packages/core/src/core/geminiChat.ts
…ign block (PR QwenLM#4176) Addresses the pomelo-nwu review observation on c30bba6: ~60% of the +918 lines added to geminiChat.ts were comments, with several 20–40 line prose blocks repeating the same race-class / tool_use_id wedge analysis at every call site. Net –139 / +117 (net delete ~22 lines but mostly redistributing weight from per-site blocks to a single canonical note). Changes: * Add one canonical design block above ORPHAN_TOOL_USE_REPAIR_REASON covering: the tool_use_id 400 wedge, the three race classes (A Ctrl+Y mid-flight, B process crash, C SSE drop), the two-layer fix (partial-push + repair), and the partial-push marker lifecycle. Every per-site comment that used to repeat this now points back. * Trim popPartialIfPushed COMPRESSED-branch comment from 14 lines to 4 (no-op-today + reason for keeping). * Trim recovery-catch comment block from ~38 lines to ~7 (pop-order matters + index-checked, pointer to canonical note). * Trim addHistory invariant comment from ~22 lines to ~5 (when this fires + why error-level, pointer to canonical note). * Trim processStreamResponse partial-push comment from ~22 lines to ~8 (Race C name + signal + plain-text-not-persisted rule). * Replace tool_use_id wedge re-explanations at 5 sites (repair doc, scanModelTurn doc, applyRepair doc, popPartialIfPushed comment, processStreamResponse comment) with single-line pointers. Canonical block at line 411 is now the only authoritative copy. Reviewer marked this as non-blocking ("shouldn't gate the merge"); done as cleanup polish while LGTM is in. Tests: 375/375 (no test changes). tsc + eslint + prettier clean.
|
@pomelo-nwu thanks for the careful read — the cleanup landed in
Net The invariants you flagged as load-bearing (tool_use↔tool_result contract, marker lifecycle, hoist before tail-append) all kept their "why" — just lifted out of the per-site duplicates and into one canonical place so future drift hits one site instead of five. 中文@pomelo-nwu 感谢细读 —— 清理已经在
你标记为承重的几个不变量(tool_use↔tool_result 契约、marker 生命周期、hoist 必须在 head 而非 tail)都保留了 "why",只是从各处重复迁移到一处权威位置 —— 以后再漂移只会集中在一个地方而不是五处。 |
🧪 Local full-suite test report @ `298588190`Ran `npx vitest run packages/core packages/cli` locally on the latest PR head to back up the green CI check. Overall
PR-touched suites — all green
Covers everything this PR added: partial-push marker invariants × 6, duplicate-fr drop × 2, recovery-path pop ordering, escalation flush, transient-retry rollback, `getHistoryFunctionResponseIds` unit tests × 6, mixed-batch dedup fast-path assertion. The 1 failure — pre-existing on `main`, unrelated to this PR```
expected 'claude-cli/1.2.3 (external, cli)' to contain 'QwenCode/1.2.3' Verification: temporarily `git checkout origin/main -- packages/core/src/core/anthropicContentGenerator/` and re-ran that single case → still fails on main alone. This PR doesn't touch any `anthropicContentGenerator/*` files. The test was authored 2026-05-11 (`bdd5b602de`), and the regression appears to stem from `4b25f9c05` (`fix(core): set x-api-key alongside Authorization on Anthropic outbound (#4323)`) — expected the build-time-injected `QwenCode/1.2.3` but receives the hard-coded `claude-cli/...` UA. → Worth a separate follow-up issue for the Anthropic content generator maintainer; out of scope for #4176. ConclusionZero regressions from this PR across the full `packages/core` + `packages/cli` matrix (15,863/15,863 passing once you exclude the pre-existing unrelated fail). Safe to merge. 中文🧪 本地全量测试报告 @ `298588190`为 CI 绿灯背书,本地跑了 `npx vitest run packages/core packages/cli`。 整体:15,851 / 15,864 通过(1 fail / 12 skipped),耗时 286s。TSC core+cli、ESLint、Prettier(已改动文件)全部 clean。 PR 涉及的三个测试文件全绿:geminiChat.test.ts (123/123)、useGeminiStream.test.tsx (101/101)、client.test.ts (151/151) —— 覆盖了本 PR 新增的所有回归测试。 那 1 个失败与本 PR 无关,main 上已存在:`anthropicContentGenerator.test.ts:262` — 期望 `QwenCode/1.2.3` 但实际拿到 `claude-cli/...`。验证手段:把 `anthropicContentGenerator/` 临时 checkout 到 `origin/main` 后单独跑该 case 依然 fail。本 PR 没动该目录任何文件。可能源于 `#4323` 的 Anthropic 出站 header 改造。建议另起 follow-up issue。 结论:本 PR 在完整测试矩阵上零回归,可以放心 merge。 |
yiliang114
left a comment
There was a problem hiding this comment.
LGTM. All 38 review threads resolved, CI green on Linux/macOS, test coverage comprehensive (353+ tests). The three-phase repair refactor is clean and the partial-push lifecycle is well-guarded. Ship it.
…nLM#4176) Second cleanup pass on pomelo-nwu's feedback. Previous commit (2985881) cut net –22 lines but several blocks were still 20–25 lines of prose; this pass takes the rest down to the one-or-two-line pointer pattern the reviewer suggested. geminiChat.ts: 2384 → 2207 lines (-177). Comment density 37% → 32%. Trimmed: * pendingPartialAssistantTurnIndex + pendingPartialAssistantRecord field doc blocks (16 + 22 lines → one combined 5-line block pointing at the canonical note). * clearPendingPartialState method doc (10 → 4 lines). * sendMessageStream inline-repair comment block (29 → 7 lines). * `finally` JSONL flush block (32 → 8 lines). * repairOrphanedToolUseTurns function doc (39 → 14 lines). * scanModelTurn / planRepair / applyRepair phase docs (11–22 → 4–9 lines each). * RepairPlan interface doc (14 → 1 line). * GeminiChat.repairOrphanedToolUseTurns wrapper doc (9 → 3 lines). * GeminiChat.getHistoryFunctionResponseIds doc (13 → 6 lines). What stayed (intentional load-bearing context): * The canonical block above ORPHAN_TOOL_USE_REPAIR_REASON — that's the consolidation point every per-site one-liner points back to. * Pre-existing upstream JSDoc (ctor, sendMessageStream, getHistory, redactStructuredOutputArgsForRecording). * The Reactive compression no-op note (already at the 4-line cap). * The recovery-catch 7-line block (already at the cap). * The addHistory invariant 5-line block (already at the cap). Tests: 375/375 pass on the affected suites. tsc + eslint + prettier clean on changed files. Behavior unchanged.
|
Second pass on the comment-density feedback —
Trimmed (one-or-two-line pointer + canonical note reference):
What's left untouched is intentional: the canonical note (the consolidation point), pre-existing upstream JSDoc (ctor / `sendMessageStream` / `getHistory`), and the three blocks already at the cap (Reactive compression no-op = 4 lines, recovery catch = 7, addHistory invariant = 5). 375/375 still pass; tsc + eslint + prettier clean. 中文针对注释体量的第二轮清理 —— `e2033ec87`。第一次清理(`298588190`)把 wedge 分析集中到了一个 canonical block,但确实还留了几处 20–25 行的 per-site 大块没动;这次全部压成 1–2 行 pointer 的模式。 `geminiChat.ts`:2384 → 2207 行 (-177),注释密度 37% → 32%。 压缩明细(每处只留 1–2 行 + 指向 canonical note):
保留不动的几处是有意的:canonical note(集中说明的唯一权威位置)、上游 js-genai 原本就有的 JSDoc(ctor / `sendMessageStream` / `getHistory`)、以及已经压到上限的三处(Reactive compression no-op = 4 行、recovery catch = 7、addHistory 不变量 = 5)。 375/375 仍然通过,tsc + eslint + prettier 全绿。 |
|
Following this closely — really appreciate the depth of the root-cause writeup, and especially the explicit contrast with upstream's in-band Filed #4387 as an RFC to track the longer-term direction: moving tool dispatch from end-of-turn batched to stream-driven in-band, building on top of the invariant guarantees this PR establishes. Not asking for any action on this PR — the RFC is explicit that it builds on #4176 and intentionally avoids proposing anything that would conflict with the work in flight here. I'll revise the design notes once this lands. Happy to discuss in #4387 or here, whichever you prefer. |
BZ-D
left a comment
There was a problem hiding this comment.
Review
Solid fix for a genuinely tricky wedge. The 3-layer defense (partial push → repair pass → handleCompletedTools dedup) is well-chosen, and splitting repairOrphanedToolUseTurns into SCAN / DECISION / MUTATION phases makes it auditable. Test coverage at every failure point is impressive — particularly the regression guards for non-obvious invariants (dedup-before-isResponding, partial-push marker resets across all 6 history-mutation sites, escalation flush on finally).
A few non-blocking items worth considering:
1. Per-send repair pass cost — geminiChat.ts sendMessageStream
repairOrphanedToolUseTurns(this.history) runs on every send, walking the full history. For typical sessions this is microseconds, but on a 1000+ turn session with parallel tool_uses it could add up. Consider whether a "dirty bit" (e.g., set on stream error / setHistory / --resume load) is worth gating the pass on. Probably overkill given current usage, but worth a thought.
2. [PARTIAL_PUSH] log volume — geminiChat.ts partial-push branch
This warn fires on EVERY mid-stream tool_use error. On flaky networks (the exact use case this PR targets), it could spam logs. The other [REPAIR] warns are conditional on actual synthesis/drop happening, so they self-rate-limit; this one doesn't. Either rate-limiting or downgrading to info would help diagnostics stay signal-rich.
3. getHistoryFunctionResponseIds on public GeminiClient surface — client.ts:368
The method exists purely for the React hook's perf hot path. The JSDoc is clear about why, but it's a slight leak — the dispatcher contract (historyCallIdsWithResponse) is now spread across the core/cli boundary. A future refactor might consider wrapping it inside something like client.filterToolsForSubmission(...) so the cli side doesn't need to know about the dedup key shape. Not blocking.
4. Dead-code branch documented as "currently unreachable" — geminiChat.ts isContentError branch
The preserved popPartialIfPushed() call below the transient-stream retry. The comment is admirably honest and the rationale (defense-in-depth for a future error class that diverges the predicates) is sound. Some reviewers prefer to remove dead branches entirely and add them back when needed; others (myself included) think keeping the wiring with the WHY documented is fine. Your call.
5. Minor: comment density in tests
The WHY-style comments are appropriate for these subtle invariants and I'd rather have them than not, but a few blocks in the test file exceed the code they describe by 3–5×. Not asking to trim — just flagging that future readers may find the test file heavy to navigate.
Worth pinning explicitly
- The dedup MUST run before the
isRespondingearly-return — there's a dedicated regression test for this (runs Race A dedup BEFORE the isResponding early-return) which is exactly the right safeguard. Anyone refactoringhandleCompletedToolslater: do not move that block. - Every history-mutation method clearing the partial-push markers in lockstep is a load-bearing invariant. The
partial-push marker invariants on history mutationdescribe block covers all 6 sites — keep it that way if new mutation methods are added.
LGTM with the optional considerations above. No request-changes.
BZ-D
left a comment
There was a problem hiding this comment.
LGTM. 之前的 review 评论里有几条非阻塞建议可以后续考虑,但不影响合并。
Real-environment verification (weak-network drop)I ran an end-to-end weak-network reproduction against a real backend to check this PR's behavior, and want to share the data + an honest scope caveat. SetupReal CLI in tmux (dev mode, running the TS source) → a transparent SSE proxy that injects a mid-stream drop → dashscope Result — main and PR behave identically; neither wedges
Both versions recover cleanly on Ctrl+Y. Crucially, the retry request (#2) carries no dangling Why the OpenAI path doesn't wedgeOn the OpenAI path the pipeline holds the finish chunk and only yields the functionCall on the next chunk ( Honest scope / caveatI only exercised the OpenAI path. I did not reproduce the anthropic-protocol wedge — the scenario this PR actually fixes. That conclusion rests on code reading + the PR description + the upstream TakeawayThe fix addresses a real bug, but it appears anthropic-protocol-specific. OpenAI-compatible providers (incl. dashscope deepseek/qwen) don't hit this wedge at this drop point, so the "applies to openai-compatible providers too" framing may overstate the impact for those users. Since the change touches the shared core stream path for all providers, the unchecked manual weak-network test on the anthropic path is the highest-value item to complete before merge — it's the one scenario that demonstrates the fix's positive value. 中文(完整报告)验证目标复现 PR 描述的弱网 wedge:SSE 在 tool_use 的 环境与方法拓扑:
断流点精确性(代理 SSE 日志): 执行与证据main 版:断流 → 两版代理请求流逐行一致: 结论在 dashscope deepseek(OpenAI 协议)这个真实场景下,main 与 PR 行为完全一致:断流后 Ctrl+Y 重试都能干净恢复,均不产生 wedge。关键证据:两版重试请求的 messages 都没有悬空的 原因(协议差异):
限定(重要)只实测了 OpenAI 协议路径。anthropic 协议路径的 wedge 未实测——对它「确实会 wedge、且本 PR 能修复」的判断来自代码阅读 + PR 描述 + 上游 总结本 PR 修的是真 bug,但看起来是 anthropic 协议特有的;OpenAI 兼容后端(含 dashscope deepseek/qwen)在这个断流点不触发该 wedge。鉴于改动压在所有 provider 共享的核心流式路径上,那项未勾选的 anthropic 弱网手动测试是合并前最有价值的一环——它正是唯一能证明本 PR 正面价值的场景。 |
… for MCP related tools (QwenLM#4176) Co-authored-by: Jack Wotherspoon <jackwoth@google.com>
Summary
Fixes an unrecoverable wedge on weak-network connections (e.g. on a train) when calling DeepSeek-V4-Pro (or any Anthropic-compatible backend) through the Anthropic protocol. Same wedge reproduces under three additional failure modes surfaced in review; all closed in this PR.
Symptom — the model 400s with:
and pressing Ctrl+Y to retry yields the same error indefinitely. The chat session is permanently wedged.
Root cause — mid-stream
tool_usedropThe Anthropic-compatible SSE emits
tool_usecontent blocks ascontent_block_start→input_json_delta*→content_block_stop→message_delta→message_stop. InanthropicContentGenerator.processStream, thefunctionCallchunk is yielded atcontent_block_stop.When the SSE drops between
content_block_stopof a tool_use and the terminalmessage_stop(typical for flaky train/airplane WiFi):functionCallchunk has already been yielded →Turn.runregisters aToolCallRequestevent →useGeminiStreamschedules the tool after the for-await exits.processStreamResponse'sthis.history.push({role:'model'…})never runs — the for-await throws before reaching it.handleCompletedTools→submitQuery(…, ToolResult)→geminiChat.sendMessageStreampushesuser:[functionResponse]into history.[…, user:"query", user:[tool_result]]. The converter emits two consecutive user messages where the second carriestool_resultblocks but the preceding message has no matchingtool_use→ 400.stripOrphanedUserEntriesFromHistoryonly pops trailing user entries. The losttool_useis unrecoverable, and the chat is wedged for the rest of the session.Fixes
1. Persist partial assistant turn when stream errors mid
tool_useprocessStreamResponse(packages/core/src/core/geminiChat.ts) wraps its for-await in a try/catch. On stream error, if anyfunctionCallchunk was already yielded (hasToolCall === true) and we accumulated parts, persist the partial assistant turn into history before re-throwing — so the eventualtool_resultsubmission has its matchingtool_use.Plain-text partial turns (no
functionCall) are intentionally not persisted: the Retry path pops the trailing user prompt and re-issues it; a stale partial-text model turn between them would either bias the retry or surface as duplicate output.The shared
processStreamResponsecovers both Anthropic and OpenAI content generators, so the fix applies to all anthropic-compatible and openai-compatible providers — not just DeepSeek.2. Close
tool_use ↔ tool_resultinvariant at every failure pointSurfaced in @yiliang114's review: the partial-history push above relies on the React tool scheduler eventually producing a matching
tool_result. Three races break that contract:model[tool_use], not user, sostripOrphanedUserEntriesFromHistorydoesn't pop it. Retry pushes a fresh user turn after the orphan tool_use. Meanwhile the scheduler'sonAllToolCallsCompleteis single-shot AND gated onisResponding(useGeminiStream.ts:1971), so the eventual realtool_resultis silently swallowed.model[tool_use]wedges the first API call after--resume.Three pieces working together close every race:
repairOrphanedToolUseTurns(history)ingeminiChat.ts— walks history left-to-right and synthesizes anerror-typedfunctionResponsefor everyfunctionCallwhoseidisn't echoed back in the next user turn. Appends to an existing user turn when present, otherwise inserts a new one.client.startChat()after loading the transcript — covers Race B/C (--resume).chat.sendMessageStreamimmediately afterthis.history.push(userContent)— covers Race A in-session. The user-supplied turn gets the first chance to close the pair before the synthesizer sees it as dangling, so a Retry of a previous ToolResult submission (lastPrompt is afunctionResponse-parts array) lands the realfunctionResponseinstead of a synthetic error.handleCompletedTools(useGeminiStream.ts) — before submitting tool results, scanchat.historyfor anyfunctionResponse.idalready present (planted by the repair pass) and drop those toolCalls' responseParts from the wire payload while stillmarkToolsAsSubmittedso the UI/scheduler advances. The dedup runs before theisRespondingearly-return so the scheduler's single-shotonAllToolCallsCompletedoesn't leave the tool stuck incompleted-but-not-submittedwhen a Retry stream is in flight.This is the qwen-code analogue of Claude Code's
yieldMissingToolResultBlocks(query.ts:123-149), split across the core/cli boundary because our React scheduler runs out-of-band from the stream loop (upstream's in-bandStreamingToolExecutorcan atomically.discard()live tools at the synthesis point; we use history-dedup athandleCompletedToolsinstead to keep the in-flight scheduler's late real result from colliding with the synthesized error).Test plan
npx vitest run packages/core/src/core/geminiChat.test.ts— 90 tests pass (3 partial-history-push cases + 8 repair-helper unit cases + 2 chat.sendMessageStream integration cases)npx vitest run packages/core/src/core/client.test.ts— 131 tests pass (Retry path mocks updated for new method)npx vitest run packages/cli/src/ui/hooks/useGeminiStream.test.tsx— 92 tests pass (Race A end-to-end dedup integration test)npx vitest run --project '@qwen-code/qwen-code-core'— all 8186 core tests passtsc --noEmit -p packages/core/tsconfig.jsonclean--resumeof a session crashed mid-flight.Related
anthropics/claude-code#6836— upstream "tool_use/tool_result block mismatch" meta-issue with the same fix shape (~155 duplicates,oncalllabel, OPEN)Closes #4178